{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning at Scale, Part II"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this tutorial, we'll dig deeper into BIDMach's learning architecture. The examples so far have use convenience functions which assembled together a Data Source, Learner, Model, Updater and Mixin classes to make a trainable model. This time we'll separate out those components and see how they can be customized. \n",
    "\n",
    "The dataset is from UCI and comprises Pubmed abstracts. It is about 7.3GB in text form. We'll compute an LDA topic model for this dataset. \n",
    "\n",
    "First lets initialize BIDMach again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1 CUDA device found, CUDA version 7.0\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "(0.99132067,11974557696,12079398912)"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import BIDMat.{CMat,CSMat,DMat,Dict,IDict,Image,FMat,FND,GDMat,GMat,GIMat,GSDMat,GSMat,\n",
    "               HMat,IMat,Mat,SMat,SBMat,SDMat}\n",
    "import BIDMat.MatFunctions._\n",
    "import BIDMat.SciFunctions._\n",
    "import BIDMat.Solvers._\n",
    "import BIDMat.JPlotting._\n",
    "import BIDMach.Learner\n",
    "import BIDMach.models.{FM,GLM,KMeans,KMeansw,ICA,LDA,LDAgibbs,Model,NMF,RandomForest,SFA,SVD}\n",
    "import BIDMach.datasources.{DataSource,MatSource,FileSource,SFileSource}\n",
    "import BIDMach.mixins.{CosineSim,Perplexity,Top,L1Regularizer,L2Regularizer}\n",
    "import BIDMach.updaters.{ADAGrad,Batch,BatchNorm,IncMult,IncNorm,Telescoping}\n",
    "import BIDMach.causal.{IPTW}\n",
    "\n",
    "Mat.checkMKL\n",
    "Mat.checkCUDA\n",
    "Mat.setInline\n",
    "if (Mat.hasCUDA > 0) GPUmem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the GPU memory again, and make sure you dont have any dangling processes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Large-scale Topic Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A **Topic model** is a representation of a Bag-Of-Words corpus as several factors or topics. Each topic should represent a theme that recurs in the corpus. Concretely, the output of the topic model will be an (ntopics x nfeatures) matrix we will call <code>tmodel</code>. Each row of that matrix represents a topic, and the elements of that row are word probabilities for the topic (i.e. the rows sum to 1). There is more about topic models <a href=\"http://en.wikipedia.org/wiki/Topic_model\">here on wikipedia</a>.\n",
    "\n",
    "The **element <code>tmodel(i,j)</code> holds the probability that word j belongs to topic i**. Later we will examine the topics directly and try to make sense of them.\n",
    "\n",
    "Lets construct a learner with a files data source. Most model classes will accept a String argument, and assume it is a pattern for accessing a collection of files. To create the learner, we pass this pattern (which will be invoked with <string> format i) to enumerate one filename. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "BIDMach.models.LDA$FileOpts@4f6fbde5"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val mdir = \"../data/uci/pubmed_parts/\";\n",
    "val (nn, opts) = LDA.learner(mdir+\"part%02d.smat.lz4\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this dataset is quite large, and isnt one of the ones loaded by <code>getdata.sh</code> in the <code>scripts</code> directory. You need to run the script <code>getpubmed.sh</code> separately (and plan a long walk or bike ride while you wait...). \n",
    "\n",
    "This datasource uses just this sequence of files, and each matrix has 141043 rows. A number of options are listed below that control the files datasource. Most of these dont need to be set (you'll notice they're just set to their default values), but its useful to know about them for customizing data sources. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "300"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "opts.nstart = 0;                 // Starting file number\n",
    "opts.nend = 10;                  // Ending file number\n",
    "opts.order = 0;                  // (0) sample order, 0=linear, 1=random\n",
    "opts.lookahead = 2;              // (2) number of prefetch threads\n",
    "opts.featType = 1;               // (1) feature type, 0=binary, 1=linear\n",
    "// These are specific to SfilesDS:\n",
    "opts.fcounts = icol(141043);     // how many rows to pull from each input matrix \n",
    "opts.eltsPerSample = 300         // how many rows to allocate (non-zeros per sample)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're ready to go. LDA is a popular topic model, described <a href=\"http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation\">here on wikipedia</a>.\n",
    "\n",
    "We use a fast version of LDA which uses an incremental multiplicative update described by Hoffman, Blei and Bach \n",
    "<a href=\"https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf\">here</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tuning Options"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add tuning options for minibatch size (say 100k), number of passes (4) and dimension (<code>dim = 256</code>). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "256"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "opts.batchSize=50000\n",
    "opts.npasses=2\n",
    "opts.dim=256"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You invoke the learner the same way as before. You can change the options above after each run to optimize performance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "corpus perplexity=9554.363735\n",
      "pass= 0\n",
      " 1.00%, ll=-8.62910, gf=19.682, secs=1.6, GB=0.04, MB/s=26.00, GPUmem=0.915805\n",
      " 7.00%, ll=-7.21284, gf=30.997, secs=6.8, GB=0.27, MB/s=39.59, GPUmem=0.915805\n",
      "14.00%, ll=-7.07065, gf=31.161, secs=12.5, GB=0.50, MB/s=39.69, GPUmem=0.915805\n",
      "21.00%, ll=-6.99135, gf=31.853, secs=17.9, GB=0.72, MB/s=40.53, GPUmem=0.915805\n",
      "28.00%, ll=-6.99923, gf=32.446, secs=23.1, GB=0.95, MB/s=41.27, GPUmem=0.915805\n",
      "34.00%, ll=-6.97575, gf=32.613, secs=28.4, GB=1.18, MB/s=41.46, GPUmem=0.915805\n",
      "41.00%, ll=-6.91192, gf=32.848, secs=33.7, GB=1.41, MB/s=41.76, GPUmem=0.915805\n",
      "48.00%, ll=-6.95156, gf=32.909, secs=39.1, GB=1.64, MB/s=41.83, GPUmem=0.915805\n",
      "54.00%, ll=-6.95275, gf=32.837, secs=44.6, GB=1.86, MB/s=41.73, GPUmem=0.915805\n",
      "61.00%, ll=-6.88738, gf=32.913, secs=50.0, GB=2.09, MB/s=41.82, GPUmem=0.915805\n",
      "68.00%, ll=-6.93559, gf=33.058, secs=55.2, GB=2.32, MB/s=42.00, GPUmem=0.915805\n",
      "75.00%, ll=-6.92956, gf=33.061, secs=60.6, GB=2.55, MB/s=42.00, GPUmem=0.915805\n",
      "81.00%, ll=-6.88030, gf=33.178, secs=65.9, GB=2.78, MB/s=42.15, GPUmem=0.915805\n",
      "88.00%, ll=-6.92671, gf=33.192, secs=71.2, GB=3.00, MB/s=42.17, GPUmem=0.915805\n",
      "95.00%, ll=-6.91851, gf=33.268, secs=76.4, GB=3.23, MB/s=42.26, GPUmem=0.915805\n",
      "100.00%, ll=-6.93309, gf=33.316, secs=80.3, GB=3.40, MB/s=42.33, GPUmem=0.915805\n",
      "pass= 1\n",
      " 1.00%, ll=-6.88018, gf=32.976, secs=82.1, GB=3.44, MB/s=41.92, GPUmem=0.915805\n",
      " 7.00%, ll=-6.90460, gf=33.061, secs=87.2, GB=3.67, MB/s=42.02, GPUmem=0.915805\n",
      "14.00%, ll=-6.89758, gf=33.135, secs=92.5, GB=3.89, MB/s=42.11, GPUmem=0.915805\n",
      "21.00%, ll=-6.86670, gf=33.195, secs=97.7, GB=4.12, MB/s=42.19, GPUmem=0.915805\n",
      "28.00%, ll=-6.90028, gf=33.249, secs=103.0, GB=4.35, MB/s=42.25, GPUmem=0.915805\n",
      "34.00%, ll=-6.89525, gf=33.306, secs=108.1, GB=4.58, MB/s=42.32, GPUmem=0.915805\n",
      "41.00%, ll=-6.84500, gf=33.314, secs=113.6, GB=4.81, MB/s=42.33, GPUmem=0.915805\n",
      "48.00%, ll=-6.89398, gf=33.365, secs=118.7, GB=5.03, MB/s=42.40, GPUmem=0.915805\n",
      "54.00%, ll=-6.90187, gf=33.413, secs=123.9, GB=5.26, MB/s=42.46, GPUmem=0.915805\n",
      "61.00%, ll=-6.84352, gf=33.452, secs=129.2, GB=5.49, MB/s=42.50, GPUmem=0.915805\n",
      "68.00%, ll=-6.89512, gf=33.492, secs=134.4, GB=5.72, MB/s=42.55, GPUmem=0.915805\n",
      "75.00%, ll=-6.89321, gf=33.531, secs=139.5, GB=5.94, MB/s=42.60, GPUmem=0.915805\n",
      "81.00%, ll=-6.84816, gf=33.537, secs=144.9, GB=6.17, MB/s=42.61, GPUmem=0.915805\n",
      "88.00%, ll=-6.89672, gf=33.574, secs=150.1, GB=6.40, MB/s=42.65, GPUmem=0.915805\n",
      "95.00%, ll=-6.89115, gf=33.598, secs=155.3, GB=6.63, MB/s=42.68, GPUmem=0.915805\n",
      "100.00%, ll=-6.90773, gf=33.614, secs=159.1, GB=6.80, MB/s=42.71, GPUmem=0.915805\n",
      "Time=159.1470 secs, gflops=33.61\n"
     ]
    }
   ],
   "source": [
    "nn.train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each training run creates a <code>results</code> matrix which is essentially a graph of the log likelihood vs number of input samples. The first row is the likelihood values, the second is the corresponding number of input samples procesed. We can plot the results here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAfQAAAGQCAYAAABYs5LGAAAj2ElEQVR42u3db4xddZ3HcRJjjDE+8IkPjDHGZ8b4yMRHxkyaNtOUEBTS0iogdLBitThd6FKt1W67i+Ky29nWsq2s1WK3zhbpFKsIgkVBGVZmO2UBEXep1SIC7XYo2ErtMN+d77l7Z2em0z93mHbOvef1Tk5mzr135p77+9xz3r/v+d4/FwQAAGh6LjAEAAAQOgAAIHQAAEDoAACA0AEAIHQAtZ3oggvOepno9lXiox/9aGUfO0DoAKE3Pc8//3ysXbu20pMZgNCBFhO68SF0gNCBJhAWCB0gdKACQj/V7U+cOBGrVq2Kd77znfHGN74x3vzmN0dbW1vx+6n+/6n+1+m2afx1r732Wtx0003x7ne/O97whjeM3O7xxx+Pa665Jt71rneNbM/73ve++OIXvxivvPIKoQOEDhD6RLdvb29v+LT9VAj9Qx/60Em3/fa3v13I/VTbkGI/fPiwsxkAoQOtK/TJvChu69atJ0n20KFD8eKLL55zoY9f+vv7x6zfeOONRRW/fv36ky4ndIDQAUIfddn46vy+++6blJwn8zfbtm0bcwr98ssvH3P98ePHi8tT6qMvf8973kPoAKEDhD76suxPj+9rny+hj+dtb3vbWT2W7KsTOkDoQMsKfTK3n6yAz4XQT9c7f71CJnSA0IGWFvr4Cn06hf7Wt751zPX56vvpGisAhA40ldA/+MEPjrns6aefPqv/P/4tbT//+c+jr6/vdQn90ksvHXP97bffftJtjh49WryljtABQgcIfdTt8/3noy/L939nH/1HP/rRaf//e9/73il7q1udX/7yl2Ouf/vb3x7d3d1FpZ7btHv37uJta2crZJ+iBxA6UBmh53u68wNcziS90R/8kqxevfqk2+Srz1+P0JN8G92b3vSmKZEwoQOEDlRG6Mmzzz4bV155ZbzlLW8pTqV/+MMfjgcffHDMbbPXPp6lS5cWl6eAFy5cWEwOXq/Qk/3798eyZcuKdkD+/5xM5H28//3vj0WLFhWVPKEDhA7gLHjyySfHSG8yfWsAhA7gPJKf4b5hw4bi0+GSp556qqjSRws9T4UDAKEDZd4Zz3BK+qqrrjJIAAgdKDsf+9jH4gMf+EDxPvDsVefyjne8I+bPnx+7du0yQAAIHQAAQgcAAIQOAAAIHQAAEDoAAIQOAAAI/fwwNDQU69atizVr1hRfcJHLwMCAZAEAhN5M5Cdnbd++fWQ9vxWqq6tLsgAAQm8m8ksiDh06NLJ+8ODBWLBggWQBAITeTLS3t49Zz1Pw4y8DAIDQS85E3z51qm+k6u/vj5/+9Kfx/e9/f8zlH//4x+MTn/hEfO5zn4svf/nLxU/rrbfe0dFhPCqyftlllxmPiqznft2qj6+zs1OFfqYKPaU+mhw8tD4PPPCAQZA1ZN00pOArJfTFixfH4cOHR9azhz5v3ryGhJ4zIbQ+v/3tbw2CrCFrQi8r27ZtK17ZXid/X7t2bUNCb3TQAAAg9ClmMu9DV6GbyUPWkDWhtwB66NVEX1XWkDWht7jQVehm8pA1ZE3oLSB0PXQAAKGr0GEmD1lD1oReBqHroVcDfVVZQ9aErkKHmTxkDVkTetmFrocOACB0FTrM5CFryJrQyyB0PfRqoK8qa8ia0FXoMJOHrCFrQi+70PXQAQCErkKHmTxkDVkTehmErodeDfRVZQ1ZE7oKHWbykDVkTehlF7oeOgCA0FXoMJOHrCFrQi+D0PXQq4G+qqwha0JXocNMHrKGrAm97ELXQwcAELoKHWbykDVkTehlELoeejXQV5U1ZE3oKnSYyUPWkDWhl13oeugAAEIvCfv37485c+ao0KFqg6xlTejNSltb28gyGaHroVcDfVVZQ9aE3kRiV6FD1QZZy5rQKyp0PXQAAKE3kdD7+/sLmXd3d4+5vKOjozhtU5/p5U/rrbfe29trPCqy3tPTYzwqsp77das+vpYX+un65XroOB36qrKGrFXoLVKhn0roeujVQF9V1pA1obe40PXQAQCEXhKRj19U6FC1yRqyJvQWRw+9muiryhqyJvQWF7oK3UwesoasCb0FhK6HDgAgdBU6zOQha8ia0MsgdD30aqCvKmvImtBV6DCTh6wha0Ivu9D10AEAhK5Ch5k8ZA1ZE3oZhK6HXg30VWUNWRO6Ch1m8pA1ZE3oZRe6HjoAgNBV6DCTh6wha0Ivg9D10KuBvqqsIWtCV6HDTB6yhqwJvexC10MHABC6Ch1m8pA1ZE3oZRC6Hno10FeVNWRN6Cp0mMlD1pA1oZdd6HroAABCV6HDTB6yhqwJvQxC10OvBvqqsoasCV2FDjN5yBqyJvSyC10PHQBA6NPMjh07orOzM2bNmhWzZ8+ONWvWxMGDB1XoULXJGrIm9GZi6dKl0dfXF0NDQzE4OBjf+c534tprr21I6Hro1UBfVdaQNaE3GTNnzlShQ9Uma8ia0JuZPXv2FFV7I0LXQwcAEHqJeOihh+KTn/zkKXvo/f39hcy7u7vHXN7R0VGctqnP9PKn9dZb7+3tNR4VWe/p6TEeFVnP/bpVH1/LC72trW1kGc3mzZtjxYoVMTAwcMb/oYdeTfRVZQ1Zq9BLzt69e4sXxp0teujVRF9V1pA1oZeYRx555Kyq8tMJXQ8dAEDo08zoU/CnOh2vQoeqTdaQNaG3IHro1URfVdaQNaG3uNBV6GbykDVkTegtIHQ9dAAAoavQYSYPWUPWhF4GoeuhVwN9VVlD1oSuQoeZPGQNWRN62YWuhw4AIHQVOszkIWvImtDLIHQ99GqgrypryJrQVegwk4esIWtCL7vQ9dABAISuQoeZPGQNWRN6GYSuh14N9FVlDVkTugodZvKQNWRN6GUXuh46AIDQVegwk4esIWtCL4PQ9dCrgb6qrCFrQlehw0wesoasCb3sQtdDBwAQugodZvKQNWRN6GUQuh56NdBXlTVkTegqdJjJQ9aQNaGXXeh66AAAQp9m7rzzzrjuuuti1qxZ0d7eHsuXL48DBw6o0KFqkzVkTejNxNKlS6Ovry+GhoaKZceOHXHFFVc0JHQ99GqgrypryJrQm4yZM2eq0KFqkzVkTejNyuDgYOzcubM4Bd+I0PXQAQCEXhLa2tqKZdWqVXHkyJEJb9Pf31/IvLu7e8zlHR0dxWmb+kwvf1pvvfXe3l7jUZH1np4e41GR9dyvW/XxtbzQ6+LOZTwnTpwoXiS3ZMmShip0PfRqoK8qa8hahd5k5CveGxG6Hno10FeVNWRN6CVm2bJl8dhjjxWvcH/11VeLHvqKFSsaEroeOgCA0KeZe++9t3jrWr6y/aKLLoqurq44evSoCh2qNllD1oTe6uihVxN9VVlD1oTe4kJXoZvJQ9aQNaG3gND10AEAhK5Ch5k8ZA1ZE3oZhK6HXg30VWUNWRO6Ch1m8pA1ZE3oZRe6HjoAgNBV6DCTh6wha0Ivg9D10KuBvqqsIWtCV6HDTB6yhqwJvexC10MHABC6Ch1m8pA1ZE3oZRC6Hno10FeVNWRN6Cp0mMlD1pA1oZdd6HroAABCV6HDTB6yhqwJvQxC10OvBvqqsoasCV2FDjN5yBqyJvSyC10PHQBA6Cp0mMlD1pA1oZdB6Hro1UBfVdaQNaGr0GEmD1lD1oRedqHroQMACL1E3HvvvdHW1qZCh6pN1gZB1oTerOzduze+8IUvTEroeujVQF9V1pA1oZecAwcORGdnZxw7dkyFDlUbZC1rQm9GBgYGYvHixXH48OFifTJC10MHABD6NHP99dfHvn37RtZPJ/T+/v5C5t3d3WMu7+joKE7b1Gd6+dN666339vYaj4qs9/T0GI+KrOd+3aqPr+WFnsKuL+PXx193thW6Hno10FeVNWStQm8y4Z8JPfRqoq8qa8ia0Ftc6HroAABCV6HDTB6yhqwJfTrQQ68m+qqyhqwJvcWFrkI3k4esIWtCbwGh66EDAAhdhQ4zecgasib0MghdD70a6KvKGrImdBU6zOQha8ia0MsudD10AAChq9BhJg9ZQ9bVEfqOHTtKK3Q99GqgrypryJrQp4D8BLdLL720+LpTFTrM5CFryLpJhX7hhReOfBPahg0bSiV0PXQAAKE3wP333x/t7e2F1GfPnh2PP/64Ch1m8pA1ZN1sQq+T/fQZM2ac9B3medl0CF0PvRroq8oasib0Kaa7u3uMxOvLzJkzVegwk4esIeuyC/2+++4bOeU+Z86cePrpp6dlkPTQAQCEPkkuuuiikar8tttum9ZBUqGbyUPWkDWhT5IU+YIFC+LIkSPTPkh66NVEX1XWkDWhTwH33HNPaQZJhW4mD1lD1oTeAuihAwAIvQWFrkI3k4esIWtCbwGh66FXA31VWUPWhK5Ch5k8ZA1ZE3rZha6HDgAg9Glm/MfH1hcVOlRtsoasCb3JhP56K3Q99GqgrypryJrQW1zoKnQzecgasib0Egg9v9Qll7lz58amTZvi+PHjDQldDx0AQOgl4tChQ4XQV69ePeH1/f39hczz295G09HRUZy2qc/08qf11lvv7e01HhVZ7+npMR4VWc/9ulUfX8sL/UwvfBsaGopZs2Y1VKHroVcDfVVZQ9Yq9Cbi6NGjcfHFFzckdD30aqCvKmvImtBLzNKlS6Ovr6+ozFPm69atO+PXs+qhAwAIvWTs2rUrOjs7Y8aMGXHJJZfE5s2bz/g3KnQzecgasib0FkAPvZroq8oasib0Fhe6Ct1MHrKGrAm9BYSuhw4AIHQVOszkIWvImtDLIHQ99GqgrypryJrQVegwk4esIWtCL7vQ9dABAISuQoeZPGQNWRN6GYSuh14N9FVlDVkTugodZvKQNWRN6GUXuh46AIDQVegwk4esIWtCL4PQ9dCrgb6qrCFrQlehw0wesoasCb3sQtdDBwAQugodZvKQNWRN6GUQuh56NdBXlTVkTegqdJjJQ9aQNaGXXeh66AAAQlehw0wesoasCb0MQtdDrwb6qrKGrAldhQ4zecgasib0sgtdDx0AQOglYN++fXHjjTdGe3t7sajQoWqTNWRN6E3G888/H1dddVU89thjk67Q9dCrgb6qrCFrQi8xXV1d8Ytf/KKhv1Ghm8lD1pA1oZeMBQsWFFKfPXt2cbp9xYoVMTAw0JDQ9dABAIQ+zcyYMSP6+vpiaGgoBgcHY8uWLbF8+fIJb9vf31/IvLu7e8zlHR0dxWmb+kwvf1pvvfXe3l7jUZH1np4e41GR9dyvW/XxtbzQ29raRpa60EeTYp81a1ZDFboeejXQV5U1ZK1CLzEp40OHDo257MILL2xI6Hro1UBfVdaQNaGXmDx9vn79+pG++Z49e+LWW29tSOh66AAAQi8BGzduLF4Ul6faU+4nTpxQoUPVJmvImtBbHT30aqKvKmvImtBbXOgqdDN5yBqyJvQWELoeOgCA0FXoMJOHrCFrQi+D0PXQq4G+qqwha0JXocNMHrKGrAm97ELXQwcAELoKHWbykDVkTehlELoeejXQV5U1ZE3oKnSYyUPWkDWhl13oeugAAEJXocNMHrKGrAm9DELXQ68G+qqyhqwJXYUOM3nIGrIm9LILXQ8dAEDoKnSYyUPWkDWhl0HoeujVQF9V1pA1oavQYSYPWUPWhF52oeuhAwAIXYUOM3nIGrIm9DIIXQ+9GuiryhqyJnQVOszkIWvImtDLLnQ9dADnhNdeizh4MOJXv4p48MGIHTsi/vZvI/7mbyK+9708GEU88UTEH/8Y8Ze/TM82Dg5GPPdcRH9/xI9+FPHtb0d86lMR//zPte39+c8jfv3riMOHI4aGpmcb//SniN//PmLv3ojduyM2bYq45ZaI++6rbXded+wYoRO6Cr0q/C4PSr/8Ze2A8Pzz03cAHc+rr0Y8+2ztYHX//RGrV0fs3BnxyCMR//3f03sgHc+//mvErbdG/OAHtXHMbfzP/6xtZ0rhpZemdlxTiEeORPzhDzWp9PXlOdaI738/Ytu2iG98I+If/iFi1aqIG26oiWju3DjxyU9G/PVf1+S5bl1NUnfeGfHjH9e2OQWbY57/O++jETKLlGA+zvz7AwdqUsn/nduU97dyZcSnPx2xYEHEtddGfPGLEf/0T7Xx2749ors74rvfjVi/vrbtn/1sxMc+FnHNNRHLl9dklduc41x/HuT2ptj+/OfafTfynMjnWEovn/85djlua9ZEfOYztfvN+8+xuu22iF27Iv7t3yLuvjti69aItWsjvvCFiBzT+m3zOZrPg/qk5Mkna5OS3L4U69Gjtd9zeeWViJdfri35/MhlYKD2vM5l376I//qv2rblhCLHqD4u110Xcfnl2ReNWLq0ts15v7n9X/96xMaNcWzFiojOzogrrqjd7vrrI266qXabnJDkRCrzfuGFWm6j88uxzG2tb1tuz6FDES++WDtG5PPuN7+pTbryurz9edwXKyf0tra2CZdGhK6H/jrJHTgPOLnj5I6eO2UehPIAnDtFHkzOF7mz5c6YcvzhD2s7dR4YFi2KwTwYLVlSW+oHstEH0G99q3awe/jh2k78P//z+nfe/Pv8P08/HfGLX0TcdVfEN78ZcfPNEcuWRVx9de2AlZPKPFhlVfSVrxQHqvja12rblgfSFEMKIsWQB9jbb///g31uax6E6gfSPDjlwScrwxz/FMHvflc7cOZtn3qqdoDKMerpqcmoLoz8P3kQzgPeaPL/pqhSmLltOVYbNtTGLTPP7cyDb45njmtub0dHbaxTrPk4v/SlmhjytvnY/+qvao87BbF4ce1/59/nba+8MmL+/IjLLotYuLB2uzxwf/WrtQP5li21g3VKNPN6/PGI/ftr2z8s2d7M8Zlnao8tn5c5VinbrOz+/u9r25ISyP+d95MZXHVV8TwptiEvz/XcjrwuH1Pebt682pK/52X5e/6flGGOS04afvazmkDyeTh+HM/0XMncMqPe3ppYc5xze2+8sTYWuU0prvp955JjXd/+HPN8DPlcyXHNMc3H8PGP14SY45cThXvuqeWfEm5kG1OCOXF77LHa5DMnJfm8yPHM+89xyfvLY2qu55J55njmkmObSz6ncztzyb/JyVg+3//lX2q55jE6J4op1JToaRjTQ8/nbj4PcuKXjzEnBzmRyu3L59j4/OqTgNzG3K7cnpyA5W1z/PJ5l2OZ1+d1efvMIf8mb5fP4XxOZ/45scz7JfSp49DwDn1VPolU6FN/6jB3/v/4j9qBZpQoi4NJHrS7umqX5XV5oE/55E6R1+eOkAf83LH+8R8jNm+u7bhZ5eUBtz4ByCXllwe1+pKz91xSOrnkgTqXPBX40EO1iicllweFvK/c0VIyKc6cXOSBYfh5cVKvrX4Azf+d958H/TzY5Y75+c/XHlseLHPnzoNRCiWXFFI+3ry/PEiOllLedx6s8va5Lbnz5//JnT4fd4o4K5+sRlKwKd+zPS2agkgZ5+POiUGOYY5zbmv9YJoHmrzv0Qf1FE5WLbnNuR1ZPWZOeRDKba4/prxdbn/9f+UBL/9XvdLMg+3ZTs7q1Ws+ZzKznFyl5DLPXM8ssmLMA3ZOOvKx5cQnq6Ss6OqTk0ZkM4qG+qr5PMjqLbcnnw+5DZlLvdrMx5yPp17dlYncnhMnatuY25tjPnoyl/tSPs/Ktt1TSMNZT8UkPZ+bOXnM50yOcU6OcmKZ64Q+dawdPrDfl32VBoTeVD30qdwxjx+vnfrKGXceZFN8//7vtdlxVkA5s81Zcx7o8+CeVe3f/V2tasjZ7/+J8qy2KQ+MeT95UM8nfkotK6Y8fZaiyf+dE4BcUjApnvqSwsolpZNLViu55CQhq5c8fZlizwPXGWbzDZMH8TzY33tvrarNnTfHKu8rK4E83TpaSjkeOaYphPzZzH29lFg+jnxcWekCOK9UWujPDQujI085nYL+4YNSyrw7BTCK/Js8bVOf6eXPMqwfyIPoo4/GwG23xcFhgb2Wp7KGxfracBV2fPj3wTxFNFxh/WW4gvrTcEV2PCvg4du9OlyNvTRcNf45JThchQ1efXW8NHxdcf3w7fPvh4Yrx6H/O+Wcf//ysFCL2w8L/JWvfjUODoszf2bl+ofh5Wf331+68Wl0vTdPYzbx9ls/+/Wenh7jUZH13K9b9fFVWuhfHRbQQ1mpnYHS9dCzCsxTNdl7y8o4XwFb79nkizuyks2qNqvBPNWTpyTztGCeoszqKSvfvC4rxjwVnac1sxLOvmKeCspT0rme12e1mX8/yVOZzYz3Jssaslahl4hTvfBt/7Csrs3+31kw7T30lGq+ijNfAJWnjrPXmqe1swedr25OCaewMaV4b7KsIWtCbwJWrlwZe/bsmZTQz0sPPV+okn3j7BHnC43yxVTZh85q+ny+ChwAQOhl5Zlnnonrsj98lpy3Cj2r7JR2hpKn0PPtNjnpqODpbjN5yBqyJvQzsnS42v1V9ognKfQp7aHnq4LzVfbZB8+3/+R7NPO9ifnWEkwr+qqyhqwJvcWY8go9X2T2k5/U3tebEs9eeL7PuCyfRAZVm6wha0KvgtBfVw8939qVn8KUHxqSb4vK93YDAEDoTVah56eF5ccgwkwesoasCX16hT7pHnq+cj1f7NbolztgWtBXlTVkTegq9InJLyfIj0eFmTxkDVkT+vQLfdI99PzGrPyiDAAACL1JK/R8BXt+tnp+wxHM5CFryJrQp1/ok+qh5wfE5FeBomnQV5U1ZE3oKvSTye+Gzu+khpk8ZA1ZE3o5hD6pHvqnPx3x7LP2JgAAoTdthf6730V89rOebWbykDVkTehlEnrDPfQdOyI2b7YnNRn6qrKGrAldhT6W/ApUnw5nJg9ZQ9aEXi6hNzRo+W1qWdH79jQAAKE3cYX+s59F3HKLZ5qZPGQNWRN62YTeUA89v1Vt9257UROirypryJrQVeg1Bgdr33f+0kv2IjN5yBqyJvSyCf2sB+2JJyI+/3l7EACA0Ju6Qt+yJeJ73/MsM5OHrCFrQi+j0M+6h37ddRH79tmDmhR9VVlD1oSuQo947rmIT33K3mMmD1lD1oReVqGf1aDt2hXxjW/YewAAhN7UFfqqVRF9fZ5hZvKQNWRN6OeCgwcPxsqVK6O9vb1Yli1bFs/l6fEGhH7GHvrRoxFXXhlx/Li9p4nRV5U1ZE3oJWbhwoVx1113xdDQULHceeedZxR0wxX6ww9HfOUr9hwzecgasib0c8WsWbMKkY9m5syZDQn9jIO2fn3Ej39szwEAEPq5YseOHbFly5a4++6744UXXojdu3fHzTffPOFt+/v7C5l3d3ePubyjo6M4bVOf6eXPkfXhycLgcMX/8K5dE19vvWnWe3t7jUdF1nt6eoxHRdZzv27Vx1c5oafElyxZUkh87ty5RT/95fxGtAYq9NOeov/1ryNuuMFUsQXQV5U1ZK1CLxFtbW0jS7J48eJ45plnRq7fvn17dHZ2NiT00/bQt22L+O537TUtgL6qrCFrQi8xE/XLp7SHfv31Eb/5jb0GAEDo55J8m1qedk/yxXGbN2+OFStWTE2F/uKLEddcU/TRYSYPWUPWhH4OGRgYiFWrVhWvds8le+lT1kO/556Ir3/dHtMi6KvKGrIm9BbjrCv0m27Kl1DaY8zkIWvImtCbQegTDtqrr9Y+He7YMXsMAIDQm7ZCf/TRiNWrPaPM5CFryJrQm0XoE/bQN26M+MEP7C0thL6qrCFrQq9ahZ6val+0KOKPf7S3mMlD1pA1oTeL0E8atPygmjN8OA0AAIRe9gp9+/aI22/3bDKTh6wha0JvJqGf1ENfvjziySftKS2GvqqsIWtCr1KFPjAQcfXVEYOD9hQzecgasib0ZhL6mEH7yU8iurrsJQAAQm/qCv1rX4t48EHPJDN5yBqyJvRmE/pID/0vf8mViFdesZe0IPqqsoasCb0qFfrevRErV9pDzOQha8ia0JtR6COD9s1vRuzcaQ8BABB6U1foixdH/P73nkVm8pA1ZE3ozSj0oof+5z9HbNpk72hh9FVlDVkTelUqdJjJQ9aQNaE3r9AbHTQAAAhdhQ4zecgasib0cyH0Cb8PHS2HvqqsIWtCV6HDTB6yhqwJvexC10MHABC6Ch1m8pA1ZE3or5fBwcFYv359zJ49O+bMmRO33nprw0LXQ68G+qqyhqwJvcRs3Lgx9uzZU/x+5MiR2LBhQ2zZskWFDlWbrCFrQm8msiofzcDAQFxxxRUNCV0PHQBA6NPMjBkzYmhoaMxls2bNUqFD1SZryJrQm4nly5fH9u3bi156iv3RRx89pdD7+/sLmXd3d4+5/LLLLiv6MPUnRv603nrrPT09xqMi69/61reMR0XWc79u1cfX8kJva2sbWZI8xb5y5cpC4vnCuLVr18bll1+uQoeqTdaQtQq9mXnkkUdi3bp1p73N3r17x6zfcMMNxcBZLBaLxVKWpbOzs7pCf+KJJ2LhwoVx8OBB01YAQKVoCaHn6fZcli5dGvv27ZMqAIDQAQAAoTcld911V/FCOUtrL/nuBuMga4usm2UZ/3ovQj8Lxr/qHXKGrCFrFXoT0ugsCHKGrCFrQgcAAIQOAAAIHQAAQgcAAIReKvKLXPIjYtesWROrVq0qlvxceJw/Rn8u//jP6G8kr8leNx33UUX2799/0tccly0X2Z+7rO3nhH7O2bp1a/EtbXXyvYxdXV2OvudZ6FOR12Svm477qPKkbbL7n+ybO2v7OaGfcxYtWhSHDh0aWc/Pf1+wYIEjcEmFfrq8JnvddNyHrKOUucj+3GVtPyf0c057e/tJp2HGX4Zzv+PPnDmzWObOnRubNm2K48ePN5zXZK+bjvtwkI9S5iL7cyt0+zmhn/fqsJGZJKaWnPnmjr569eqG85rsddNxH/a3KGUusj+3x1b7OaGr0CtGZpDfmmfmrkKXfWsK3X5O6OeExYsXx+HDh0fWsx8yb948R95p5OjRo3HxxRc3nNdkr5uO+3CQj1LmIvvzJ3T7OaFPOdu2bStepVgnf1+7dq0j73kkv7++r6+vmOnmTp5vCbnttttGrs+dJb8J70x5Tfa683UfOPkgX4ZcZH9+srafE/o5x/tHp59du3ZFZ2dnzJgxIy655JLYvHnzyHUvvPBCcVru2LFjZ8xrstedj/sg8onffzzducj+/GVtPyd0VJz7778/brnllqa/D8gesiZ0VJqbb745nnrqqaa/D8gesiZ0AABA6AAAEDoAACB0AABA6AAAgNABACB0AABA6ACqQn5gR34m9/r16ye8vqurq7g+bweA0AGUmI0bNxYf8XnHHXeMuTw/9zovz6/KBEDoAJqA/EzrlPfDDz9crO/evbtYz8sBEDqAJuHEiROxZMmS4ruhd+7cWfz8zGc+E8ePHzc4AKEDaCaOHDkSCxYsKCrz+fPn+7Y5gNABNCMpcEIHCB1AE1M/5Z7fOZ0vjnPKHSB0AE3Il770paIyf+CBB4r1hx56qFjPywEQOoAmoP62tXyb2miyUs/L83oAhA6gxNQ/WCY/QGYi8gNnfLAMQOgAAIDQAQAgdAAAQOgAAIDQAQAAoQMAQOgAAIDQAQAAoQMAAEIHAIDQAQAAoQMAAEIHqrTTXXBBJRYAhA60vNA9RgCEDpCdxwiA0AGy8xgBQgdAdh4jQOgACB0AoQNkN0na2toIHSB0AFUR+pluNzQ0FOvWrYs1a9bEqlWrimVgYIDQAUIHCL2ZhL5169bYvn37yHp3d3d0dXUROkDoAKFPpbR37NgR8+fPj/b29lixYkW8/PLLY0Q9ODgY69evj9mzZxdLVtt5Wf02o5eJWLRoURw6dGhk/eDBg7FgwQJCBwgdqLjQH3ggYtWqyS35t+OEvm3btjh69Ghxajzlvnbt2jFC37RpU9xxxx3F9bns3LkzNm7ceNYVek4URpP/Y/xlhA4QOlA9ob/4YsQTT0xuyb8dJ/TRZOU9Z86cMdddeumlIxV5cuLEibjkkkvOWugTXT/RZYQOEDpQLaFPIROJdebMmWOuO91tVOgAoQMoodCzEv/IRz4y5rqsxsdX6PXbnI3QFy9eHIcPHx5Zzx76vHnzCB0gdIDQp1Lo+/btG6mcf/jDH8aGDRvGiDp76PnK9HoPPfvpo3voF110URw4cOCU95E9+vz7Ovl7vU9P6AChA4Q+RUJfunRpzJo1q3gFe76a/fjx42OEntV5vrI9b5NL/p5Vep2777575BXwE+F96AChAzgPQm/1xwiA0IGWl11W3IQOEDoAsvMYAUIHQHaEDhA6QOgeIwBCB5pNdlVYABA6AABokP8Fzq5Qcvd/kyEAAAAASUVORK5CYII=",
      "text/plain": [
       "BufferedImage@18c30d65: type = 2 DirectColorModel: rmask=ff0000 gmask=ff00 bmask=ff amask=ff000000 IntegerInterleavedRaster: width = 500 height = 400 #Bands = 4 xOff = 0 yOff = 0 dataOffset[0] 0"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "plot(nn.results(1,?), nn.results(0,?))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To evaluate the model, we save the model matrix itself, and also load a dictionary of the terms in the corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "BIDMat.Dict@7a7e6283"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val tmodel = FMat(nn.modelmat)\n",
    "val dict = Dict(loadSBMat(mdir+\"../pubmed.term.sbmat.lz4\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dictionary allows us to look up terms by their index, e.g. <code>dict(1000)</code>, by their string represenation <code>dict(\"book\")</code>, and by matrices of these, e.g. <code>dict(ii)</code> where <code>ii</code> is an IMat. Try a few such queries to the dict here:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we evaluate the entropy of each dimension of the model. Recall that the entropy of a discrete probability distribution is $E = -\\sum_{i=1}^n p_i \\ln(p_i)$. The rows of the matrix are the topic probabilities.\n",
    "\n",
    "Compute the entropies for each topic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "5.1114,5.2976,4.6267,5.2456,4.3975,5.1331,4.5964,5.2893,4.7173,5.4398,5.8625,4.1256,4.5969,3.5420,4.7882,5.2300,4.4556,2.7039,3.1350,5.6754,3.5055,5.6651,5.3590,3.8807,2.3031,4.4372,3.9389,5.7394,4.4904,5.1992,6.1797,3.9195,2.3029,5.4791,4.9826,2.7610,5.2653,5.5982,1.4048,3.4634,4.4630,5.7733,2.2171,4.0891,4.4084,5.4493,6.0265,4.1399,5.8458,4.9230,5.2576,5.4123,5.9085,5.4099,3.9270,5.8947,4.5297,4.0587,5.6131,5.0017,4.4420,5.7174,5.1036,4.2589,5.6449,5.0591,3.8745,1.8975,5.8035,5.0307,4.4231,1.3162,5.2058,2.5560,5.0222,5.3514,5.6441,4.8174,2.9534,5.2721,5.2945,5.0691,5.0058,2.9747,5.8789,5.6344,5.1950,4.5837,5.4611,5.2889,5.3010,4.7849,6.6542,6.6387,5.3950,5.3466,4.6309,1.9564,4.5986,4.9530,5.0361,6.4518,3.5818,2.7613,4.5762,5.4378,4.5264,4.8327,5.6308,5.1502,5.7634,5.4617,4.0019,5.1501,4.0902,5.9191,5.6831,5.2341,2.7303,4.6649,5.2866,5.1677,4.9706,5.2402,5.1934,3.7595,5.6273,4.0434,4.0109,4.7617,5.1109,5.5546,4.8784,4.1887,3.7054,5.4612,5.2764,4.8451,5.8668,4.8190,4.7739,5.2897,3.9158,5.7511,4.9064,6.3467,5.1945,5.7916,4.3248,6.3391,5.3953,4.3254,4.0941,3.6181,4.9485,5.8006,5.3474,3.7264,4.6339,6.9143,4.5528,4.4066,4.4110,5.1522,5.0870,5.0149,5.6603,4.5497,4.3591,4.6512,5.4001,5.1460,5.8381,4.8695,5.8047,5.2049,4.6524,5.3800,6.0958,4.6190,5.4246,4.5809,4.9653,4.3739,4.2059,3.8247,4.2678,4.6395,4.4611,3.1075,2.8976,5.3807,4.6884,4.8215,5.2609,5.5787,4.9535,5.3556,4.3266,3.6062,5.0851,5.7960,5.7255,4.5949,5.4241,5.8773,5.5728,3.7411,4.9058,5.0009,5.5011,4.2409,4.5751,1.3331,3.6396,4.5907,5.0229,5.0792,5.8104,5.7106,3.6759,6.0295,3.8709,5.5567,3.2505,3.5519,2.4837,5.1221,6.2520,3.4415,5.6699,2.3538,3.5475,5.3747,4.5216,5.0225,5.0951,4.7469,5.2699,5.1190,4.4600,4.4932,3.9057,4.8174,2.8009,4.4684,2.4636,5.0251,4.3337,5.0809,5.7879,5.3045,3.3504,5.4780,5.2759,4.8923"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val ent = -(tmodel dotr ln(tmodel))\n",
    "ent.t // put them in a horizontal line"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the mean value (should be positive)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "4.7454"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mean(ent)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Find the smallest and largest entropy topic indices (use maxi2 and mini2). Call them <code>elargest</code> and <code>esmallest</code>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "71"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val (vlargest,elargest) = maxi2(ent)\n",
    "val (vsmallest,esmallest) = mini2(ent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll sort the probabilities within each topic to bring the highest probability terms to the beginning. We sort down (descending order) along dimension 2 (rows) to do this. <code>bestv</code> gets the sorted values and <code>besti</code> gets the sorted indices which are the feature indices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "   284   901   716   696   546     2  1342  2119  1120  3749   735...\n",
       "  1244  1414  2143  2309  3013  2058  2541  4052  4272   649  6756...\n",
       "   127   398   397   401  1109   178   977   115  2521  1605  1998...\n",
       "   466  1117  2414  2266  4043  2892  3263  3104  5788  4002    43...\n",
       "   675  1387  2321  3212  5436  4943  5288  3942  4564    19  8242...\n",
       "   402   533   999  1072   957  2558  3145     2  3532  2680  4939...\n",
       "  2214  2666  1673  4530  3199  3119  3694  5270  5069  6945  3868...\n",
       "   494   550   766  1486  2527  6144   267  2929  4068    18  2735...\n",
       "    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..    ..\n"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val (bestp, besti) = sortdown2(tmodel,2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now examine the 100 strongest terms in each topic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "hospital,program,care,health,cost,services,health_care,need,community,nursing,service,public,decision,plan,policy,benefit,planning,economic,organization,new,nurses,impact,management,national,goal,facilities,resources,issues,personnel,law,institution,development,home,process,article,mental_health,nurse,professional,unit,provide,under,legal,financial,effort,pharmacy,social,demand,provision,quality,strategies,medicare,project,utilization,federal,implementation,payment,staff,act,center,setting,effective,requirement,provider,market,public_health,government,ethical,health_services,private,prevention,major,individual,intervention,meet,policies,responsibility,objectives,regulation,medical,standard,court,provided,facility,insurance,current,delivery,medical_care,part,product,implication,person,based,concern,providing,research,future,agencies,saving,homes,change"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict(besti(elargest,0->100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "protein,basic,binding_protein,acidic,cap,mbp,major,s-100,amount,myelin,bound,scrapie,bind,ubiquitin,proteolipid,cellular,retinol-binding,biochemical,specifically,protein-bound,protein_level,known,acid-binding,milligram,function,appear,s100,including,constituent,involved,pmol/mg,prion,gfa,saf,flagellin,prealbumin,unknown,abundant,immunoblot,kilodalton,macromolecules,general,common,nmol/mg,lowry,show,immunologically,identification,product,variety,possible,rich,g_protein,scrapie-infected,hsp90,immunochemical,origin,abundance,fut-175,suggesting,previously,p55,identify,include,encephalitogenic,2-glycoprotein,least,unique,association,c4b-binding,cysteine-rich,called,properties,intrinsic,comparison,immunoblotting,absent,anti-mbp,share,scrapie-associated,tightly,exception,analyzed,counterpart,protein-containing,protein-free,closely,bacterially,protein_expression,protein-depleted,biuret,70-kda,globular,protein-specific,dm-20,constitute,belong,individual,identical,termed"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict(besti(esmallest,0->100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Do you notice any difference in the coherence of these two topics?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: Fill in your answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By sorting the entropies, find the 2nd and 3rd smallest entropy topics. Give the top 100 terms in each topic below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "system,reference,special,variability,laboratories,tf,identification,component,api,micro,collaborative,functioning,interlaboratory,subsystem,pmc,biochemical,consist,connected,capable,20e,operating,variabilities,generating,inter-,participating,apparatus,link,automicrobic,consisting,same-day,robot,commercial,md.,robotic,system&quot,candidate,misidentification,controlling,inter-laboratory,utilizes,20s,cusum,developed,microscan,autobac,vitek,analytab,bbl,coupled,zym,product,operate,essential,controlled,invest,lab,minitek,studying,equivalency,intralaboratory,n.y.,dual,mo.,clin,parallel,concerned,aea,comparative,mentioned,stingray,misidentified,ensures,sop,interrelated,serve,tfd,inc.,functioned,serves,regulated,participated,manufacturer,built-in,model_system,microsystem,inflexible,keyed,micro-id,ampule,hazelwood,cockeysville,dli,specialized,realized,accordance,turned,ensured,non-commercial,self-regulating,assigned"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val (sent, ient) = sort2(ent)\n",
    "// words for 2nd lowest entropy topic\n",
    "dict(besti(ient(1),0->100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    },
    {
     "data": {
      "text/plain": [
       "activity,activities,enzyme_activity,enzymes,measured,enzyme_activities,exhibited,assayed,highest,lower,biochemical,homogenates,correlated,enzymatic,enzymatic_activity,exhibit,tested,displayed,regulation,restored,functional,transferase,high_level,markedly,detected,relation,possessed,measuring,intrinsic,possess,show,enzymic,detectable,characterized,lowest,slightly,hand,devoid,parallel,weak,comparable,paralleled,u/mg,regulated,relative,relationship,marked,residual,significantly_higher,possesses,maximum,biochemically,reflect,tpp,dependent,unchanged,low_level,strong,showing,hydrolytic,remained,possessing,exhibiting,10-fold,4-fold,responsible,maximal,2-fold,affected,lowered,considerable,estimated,biologic,lysates,greatest,activity.abstract,correlate,regard,mu/mg,non-specific,respect,correlates,nearly,modulated,lacked,relatively_high,significant_increase,moderate,3-fold,comparison,strongly,fourfold,significant_decrease,decreasing,demonstrable,depended,respective,indicator,distinctly,5-fold"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "// words for 3rd lowest entropy topic\n",
    "dict(besti(ient(2),0->100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running more topics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What would you expect to happen to the average topic entropy if you run fewer topics? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> TODO: answer here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the opts.dim argument above and try it. First note the entropy at dim = 256 below. Then run again with <code>dim=64</code> and put the new value below: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table>\n",
    "<tr>\n",
    "<th>dim</th>\n",
    "<th>mean entropy</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>64</td>\n",
    "<td>...</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>256</td>\n",
    "<td>...</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "scala",
   "version": "2.11.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
